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Abstract 

We develop the information-theoretical concepts required to study the statistical dependencies 
among three variables. Some of such dependencies are pure triple interactions, in the sense that 
they cannot be explained in terms of a combination of pairwise correlations. We derive bounds 
for triple dependencies, and characterize the shape of the joint probability distribution of three 
binary variables with high triple interaction. The analysis also allows us to quantify the amount 
of redundancy in the mutual information between pairs of variables, and to assess whether the 
information between two variables is or is not mediated by a third variable. These concepts are 
applied to the analysis of written texts. We find that the probability that a given word is found 
in a particular location within the text is not only modulated by the presence or absence of other 
nearby words, but also, on the presence or absence of nearby pairs of words. We identify the words 
enclosing the key semantic concepts of the text, the triplets of words with high pairwise and triple 
interactions, and the words that mediate the pairwise interactions between other words. 

PACS numbers: 89.75.Fb, 02.50.Cw, 02.50.Sk, 89.70.-a 
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I. INTRODUCTION 


Imagine a game where, as you read through a piece of text, you occasionally come across 
a blank space representing a removed or occluded word. Your task is to guess the missing 

word. This is an example sentence,-your guess. If you were able to replace the blank 

space in the previous sentence with “make”, or “try”, or some other related word, you have 
understood the rules of the game. The task is called the Cloze test [1] and is routinely 
administered to evaluate language proficiency, or expertise in a given subject. 

The cues available to the player to solve the task can be divided into two major groups. 
First, surrounding words restrict the grammatical function of the missing word, since, for 
example, a conjugated verb cannot usually take the place of a noun, nor vice versa. Second, 
and assuming that the grammatical function of the word has already been surmised, semantic 
information provided by the surrounding words is typically helpful. That is, the presence 
or absence of specific words in the neighborhood of the blank space affect the probability 
of each candidate missing word. For example, if the word bee is near the blank space, the 
likelihood of honey is larger than when bee is absent. 

In this paper we study the structure of the probabilistic links between words due to 
semantic connections. In particular, we aim at deciding whether binary interactions between 
words suffice to describe the structure of dependencies, or whether triple and higher-order 
interactions are also relevant: Should we only care for the presence or absence of specific 
words in the vicinity of the blank space, or does the presence or absence of specific pairs 
(or higher-order combinations) also matter in our ability to guess the missing word? For 
example, one would expect that the presence of the word cell would increase the probability 
of words as cytoplasm, phone or prisoner. The word wax, in turn, is easily associated 
with ear, candle or Tussaud. However, the conjoint presence of cell and wax points much 
more specifically to concepts such as bee or honey, and diminish the probability of words 
associated with other meanings of cell and wax. Combinations of words, therefore, also 
matter in the creation of meaning, and context. The question is how relevant this effect is, 
and whether the effect of the pair ( cell + wax) is more, equal or less than the sum of the two 
individual contributions (effect of cell + effect of wax). Here we develop the mathematical 
methods to estimate these contributions quantitatively. 

The problem can be framed in more general terms. In any complex system, the statistical 
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dependence between individual units cannot always be reduced to a superposition of pairwise 
interactions. Triplet, or even higher-order dependencies may arise either because three 
or more variables are dynamically linked together, or because some hidden variables, not 
accessible to measurement, are linked to the visible variables through pairwise interactions. 


In 2006, Schneidman and coworkers [2| demonstrated that, in the vertebrate retina, up 
to pairwise correlations between neurons could account for approximately 90% of all the 
statistical dependencies in the joint probability distribution of the whole population. This 
finding brought relief to the scientific community, since an expansion up to the second order 
was regarded sufficient to provide an adequate description of the correlation structure of the 
full system. As a consequence, not much effort has been dedicated to the detection and the 
characterization of third or higher-order interactions. To our knowledge, the present work 
constitutes the first example offering an exact description of third-order dependencies. We 
derive the relevant information-theoretical measures, and then apply them to actual data. 

As a model system, we work with the vast collection of words found in written language, 
since this system is likely to embody complex statistical dependencies between individual 
words. The dependencies arise from the syntactic and semantic structures required to map 
a network of interwoven thoughts into an ordered sequence of symbols, namely, words. The 
projection from the high-dimensional space of ideas onto the single dimension represented 
by time can only be made because language encodes meaning in word order, and word 
relations. In particular, if specific words appear close to each other, they are likely to 
construct a context, or a topic. The context is important in disambiguating among the 
several meanings that words usually have. Therefore, language constitutes a model system 
where individual units (words) can be expected to exhibit high-order interactions. 

Statistics and information theory have proved to be useful in understanding language 
structures. Since Zipf’s empirical law js| on the frequency of words, and the pioneering 
work of Shannon 
has followed these lines 

availability of large data sources in the internet |8i-ll l|. 

In this paper we quantify the amount of double and triple interactions between words 
of a given text. In addition, by means of a careful analysis of the structure of pairwise 
interactions we distinguish between pairs of variables that interact directly, and pairs of 
variables that are only correlated because they both interact with a third variable. With 


4] measuring the entropy of printed English, a whole branch of science 
jfi-7]. In recent years, the discipline gained momentum with the 
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these goals in mind, we define and measure dependencies between words using concepts from 


information theory 


12 


3, 


and apply them in later sections to the analysis of written texts. 


II. STATISTICAL DEPENDENCIES AMONG THREE VARIABLES 


When it comes to quantifying the amount of statistical dependence between two variables 


X\ and X 2 with joint probabi 
Shannon’s mutual information 


ities p(x i,x 2 ) and marginal probabilities p(x i) and p(x 2 ), 
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I(X 1 ;X 2 ) = ^ p(x 1 , 0 : 2 ) log 


Xi,X2 


p{x i,X 2 ) 

p{xi)p(x 2 ) 


( 1 ) 


stands out for its generality and its simplicity. Throughout this paper we take all logarithms 
in base 2, and therefore measure all information-theoretical quantities in bits. In Fig. |U 
pairwise statistical dependencies are represented by the rods connecting two variables (inde¬ 
pendent variables appear disconnected). Since I{X 1 \X 2 ) is the Kullback-Leibler divergence 
D\p(xi,x 2 ) : p{x 1 )p{x 2 )] Q between the joint distribution p(x±,x 2 ) and its independent 
approximation p(xi)p(x 2 ), the mutual information is always non-negative. Moreover, Xi 
and X 2 are independent if and only if their mutual information vanishes. 

Three variables, in turn, may interact in different ways; Fig. |T| illustrates all the possibil¬ 
ities. In this section, we discuss several quantities that measure the strength of the different 
interactions. So far, no general consensus has been reached regarding the way in which 


statistical dependencies between three variables should be quantified 15f- 24]. One attempt 


in the framework of Information Theory is the symmetric quantity I(Xi, X 2 \ X 3 ), sometimes 
called the co-information fill I20I ] , defined as 


/(X i; X 2 \ x 3 ) = I(X 1 -X 2 ) - I{X 1 -X 2 \X 3 ) 
= I(X 2 -X 3 )-I(X 2] X 3 \X 1 ) 
— I(X 3 ',Xi) — I(X 3 ] Xi\X 2 ), 

where I(Xp Xj\Xk) is the conditional mutual information, 


( 2 ) 


I(Xi', Xj\Xk) — E 


( 3 ) 

I'X J yX fa 

The co-information measures the way one of the variables (no matter which) influences 
the transmission of information between the other two. Positive or negative values of the 
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FIG. 1. Different ways in which three variables may interact. A: The three variables are independent. 
B: Only pairwise interactions exist. These may involve 1, 2 or 3 links (from left to right). C: The three 
variables are connected by a single triple interaction. D: Double and triple interactions may coexist. The 
most general case is illustrated in the bottom-right panel. 


co-information have often been associated with redundancy or synergy between the three 
variables, though one should be careful to distinguish between several possible meanings of 


the words synergy and redundancy (see below, and also [25|, 26]) 


In an attempt to provide a systematic expansion of the different interaction orders, Amari 
[191 ] developed an alternative way of measuring triple and higher-order interactions. His 
approach unifies concepts from categorical data analysis and maximum entropy techniques. 
The theory is based on a decomposition of the joint probability distribution as a product 
of functions, each factor accounting for the interactions of a specific order. The first term 
embodies the independent approximation, the second term adds all pairwise interactions, 
subsequent terms orderly accounting for triplets, quadruplets and so forth. This approach 
constitutes the starting point for the present work. 

Given the random variables Xi, ..., Xjy governed by a joint probability distribution 
p(xi,... ,x n ), all the marginal distributions of order k can be calculated by summing the 
values of the joint distribution over n — k of the variables. Since there are n\/k\{n — k)\ 
ways of choosing n — k variables among the original n, the number of marginal distributions 
of order k is n\/k\{n — k)\ Amari defined the probability distribution p( k \x\, ...,Xn) as the 
one with maximum entropy Hmlx among all those that are compatible with all the marginal 
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distributions of ore 
unique solution 
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er k. The maximization of the entropy under such constraints has a 
the distribution allowing variables to vary with maximal freedom, 
inasmuch they still obey the restriction imposed by the marginals. Hence, p^ k \x i,...,aqv) 
contains all the statistical dependencies among groups of k variables that were present in 
the original distribution, but none of the dependencies involving more than k variables. 

The interactions of order k are quantified by the decrease of entropy from to p^ k \ 

which can be expressed as a Kullback-Leibler divergence 


jp(k) _ . p(k- l)j 


_ rr(k-l) _ Cj(k) 

— max * ± max ? 


(4) 


where Hm f L is the entropy of p k . The last inequality of Eq. (J3D derives from the generalized 


19]. As increasing constraints cannot increase the entropy, D^ is 


Pythagoras theorem 
always non-negative. 

The total amount of interactions within a group of N variables, the so called multi- 
information A(Xi,... ,Xn) 16], is defined as the Kullback-Leibler divergence between the 
actual joint probability distribution and the distribution corresponding to the independent 
approximation. The multi-information naturally splits in the sum of the different interaction 
orders 

A 12 ...n = D[p(x i,...,£jv) : p(xi)...p(x N )\ 


N 


(5) 


k =2 


(fc) 


For two variables, there are at most pairwise interactions. Their strength, measured by 
IT 2 ), coincides with Shannon’s mutual information 

D { $ = D[p( 2 \x ll x 2 ) : p {1) (x 1 ,x 2 )\ 


= D[p(xi,x 2 ) : p{xi)p(x 2 )] 


( 6 ) 


= I(X 1] X 2 ), 

since the distribution with maximum entropy that is compatible with the two univariate 
marginals is p^\xi,x 2 ) = p{xi)p{x 2 ). This result is easily obtained by searching for the 
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joint distribution that maximizes the entropy using Lagrange multipliers for the constraints 
given by the marginals 


When studying three variables, Xi, X 2 and A 3 , we separately quantify the amount of 

( 3 ) 

pairwise and of triple interactions. In this context, D\ 2 3 measures the amount of statistical 
dependency that cannot be explained by pairwise interactions, and is defined as 


D ( S s = D\p(xi,x 2 ,x 3 ) :p {2 \xi,x 2 ,x 3 )] 


( 2 )/ 


_ rr( 2 ) TT 

— max — ± 1 123 5 


(7) 


where H 123 represents the full entropy of the triplet H(Xi, X 2 , X 3 ) calculated with p(xi,x 2 , x 3 ). 

The distribution p^(x 1 ,x 2 ,x 3 ) contains up to pairwise interactions. If the actual dis¬ 
tribution p(xi,x 2 ,x 3 ) coincides with p^(xi, x 2 , x 3 ), there are no third-order interactions. 

( 3 ) 

Within Amari’s framework, hence, if D\ 23 > 0 , some of the statistical dependency among 
triplets cannot be explained in terms of pairwise interactions. 


/on 

Both I(Xi;X 2 ;X 3 ) and D\ 23 are generalizations of the mutual information intended to 


describe the interactions between t 
arbitrary number of variables 


iree variables, and both of them can be extended to an 
291 ]. It is important to notice, however, that the two 
quantities have different meanings. A vanishing co-information (/(Ad; X 2 ; X 3 ) = 0) implies 
that the mutual information between two of the variables remains unaffected if the value 
of the third variable is changed. However, this does not mean that it suffices to measure 
only pairs of variables—and thereby obtain the marginals p(xi, x 2 ),p(x 2 , x 3 ),p(x 3 , x\) —to 
reconstruct the full probability distribution p(xi,x 2 ,x 3 ). Conversely, a vanishing triple in¬ 
teraction (D [23 = 0 ) ensures that pairwise measurements suffice to reconstruct the full joint 
distribution. Yet, the value of any of the variables may still affect how much information is 
transmitted between the other two. 

We shall later need to specify the groups of variables whose marginals are used as con¬ 
straints. We therefore introduce a new notation for the maximum entropy probability 
distributions and for the maximum entropies. Let V represent a set of k variables. For 
example, if k — 3, we may have V = {Xi, X 2 , X 3 }. When studying the dependencies of 
/-th order, we shall be working with all sets Vi,... ,V r that can be formed with k variables, 
where r — n\/k\{n — k)\ Let pVi,v 2 ,...,Vr be the probability distribution of maximum entropy 
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Hv 1 ,v 2 , -,v r that satisfies the marginal restrictions of Vf, V 2 ,..., 14. Under this notation, 


p (2) (ic !,X 2 ,X 3 ) =P 12,13,23 


p ( 1 \x 1 ,x 2 ,x 3 ) =p 1 , 2 , 3 - 


(8) 


Respectively, the maximum entropies are -^ 12 , 13,23 and Hi,2,3 = H(X 1 ) + H(X 2 ) + H(X 3 ). 
Under the present notation, the mutual information /(Xp Xj) is I lv and the co-information 
of three variables Xi,X 2 ,X 3 is written as / 123 . 

The amount of pairwise interactions D)- between variables i and 7 is known to be 

n 

bounded by [141 ] 

D® =I ij <mm{H i ,H j ). (9) 

We have derived an analogous bound for triple interactions (see Appendix O . The resulting 

/o\ 

inequality links the amount of triple interactions D\ 23 with the co-information J 12 3 , 


-D 123 A min{/] 2 , 1 23 , / 31 } — I\ 23 < min{//i, H 2 , H 3 }. 


( 10 ) 


These bounds imply that pure triple interactions, appearing in the absence of pairwise 
interactions (see Fig. Q~P), may only exist if the co-information / 12 3 is negative. 


A. Characterization of the joint probability distribution of variables with high 
triple interactions 

Two binary variables X\ and X 2 can have maximal mutual information / 12 = 1 bit in two 
different situations. For the sake of concreteness, assume that X. t = ±1. Maximal mutual 
information is obtained either when X\ = X 2 or when X\ = — A" 2 . In other words, the joint 
probability distribution must either vanish when the two variables are equal, or when the two 
variables are different, as illustrated in Fig. [2]A. If the mutual information is high, though 
perhaps not maximal, then the two variables must still remain somewhat correlated, or anti¬ 
correlated. The joint probability distribution, hence, must drop for those states where the 
variables are equal - or different. I 11 this section we develop an equivalent intuitive picture 
of the joint probability distribution of triplets with maximal (or, less ambitiously, just high) 
triple interaction. 
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FIG. 2. A: Density plot of the two bivariate probability distributions that have 1 = 1 bit. Dark states have 
zero probability, and white states have p(x i,x 2 ) = 1/2. B: Density plot of the two trivariate probability 
distributions with Db/ = 1 bit. Dark states have zero probability, and white states have p{x\,x 2 ,x^) = 
1/4. C: Gradual change between a uniform distribution and a XOR distribution, for different values of 6 
(Eq. (lldll l. D: Amount of triple interactions as a function of the parameter 6. 


Consider three binary variables Xi, X 2 , A "3 taking values ±1 with joint probability distri¬ 


bution 


p(x 1 ,x 2 ,x 3 ) 


( 

1/4 if X 1 X 2 X 3 = —1 

< 


0 if X 1 X 2 X 3 = 1. 


( 11 ) 


as illustrated in Fig. 03, left side. For this probability distribution, the three univariate 
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marginals p\ , p 2 ,p3 are uniform, that is, pi( 1) = Pi( — 1) = 1/2. Moreover, the three bivariate 
marginals Pi 2 ,P 23 ,P 3 i are also uniform: Pij(l, 1 ) = Pij( 1 , - 1 ) = 1 ) = Py(- 1 , - 1 ) = 

1/4. The full distribution, however, is far from uniform, since only half of the 8 possible 
states have non-vanishing probability. 

The probability distribution of Eq. dll]) is henceforth called a XOR distribution. The 
name is inspired by the fact that two independent binary variables Xj and X 2 can be com¬ 
bined into a third dependent variable X :i = X, XOR X 2 , where XOR represents the logical 
function exclusive-OR. If the two input variables have equal probabilities for the two states 
± 1 , then Eq. (flTTl describes the joint probability distribution of the triplet (Ad, X 2j X 3 ). 

The maximum-entropy probability compatible with uniform bivariate marginals is uni¬ 
form, p( 2 \x\, x 2 , x 3 ) = 1/8. The amount of triple interactions is therefore 


D 


(3) 

123 


H 


12,13,23 


H 


123 


= 3bits — 2bits = 1 bit, 


( 12 ) 


and D [23 = A 123 , i.e. all interactions are tripletwise and D ^ 2 3 reaches the maximum 
value allowed for binary variables. Of course, the same amount of triple interactions is 
obtained for the complementary probability distribution (a so-called negative-XOR), for 
which p(xi, x 2 , x 3 ) = 1/4 when = +1 ( see Fig. [2|3, right side). 

So far we have demonstrated that XOR and —XOR distributions contain the maximal 
amount of triple interactions. Amari 19) has proved the reciprocal result: If the amount 
of triple interactions is maximal, then the distribution is either XOR or —XOR. We now 
demonstrate that if the joint distribution lies somewhere in between a uniform distribution 
and a XOR (or a —XOR) distribution, then the amount of triple interactions lies somewhere 
in between 0 and 1, and the correspondence is monotonic. To this end, we consider a family of 
joint probability distributions parametrized by a constant 9 , defined as a linear combination 
of a uniform distribution p u (x±, x 2l x 3 ) = 1/8 and a ±XOR distribution, 


i(3) 


Po{xI, x 2 , x 3 ) = - (1 + x 1 x 2 x 3 taDh 6 ) , (13) 

O 

where 9 G (—00, +00). Varying 9 from zero to 00 shifts the p(x \px 2 ,x 3 ) from the uniform 
distribution p u to the XOR probability of Eq. CD (see Fig. [2p). Negative 9 values, in turn, 
shift the distribution to —XOR. All the bivariate marginals of the distribution p Q (xi, Xj) are 
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uniform, and equal to 1/4. The maximum-entropy model compatible with these marginals 
is the uniform distribution p u (xi, X 2 , £3) = 1/8. Hence, the amount of triple interactions is 

-£^ 23 ( 0 ) = ^ [(1 + tanh#) log(l + tanh#) + (1 — tanh0) log(l — tanh 0)\. (14) 

As shown in Fig. [2p, this function is even, and varies monotonically in each of the intervals 
(— 00 , 0) and (0, + 00 ). Therefore, there is a one to one correspondence between the similarity 
between the ±XOR distribution and the amount of triple interactions. The same result is 
obtained for arbitrary binary distributions, as argued in the last paragraph of Appendix [B] 
As a consequence, we conclude that for binary variables, the ±XOR distribution is not just 
one possible example distribution with triple interactions, but rather, it is the only way in 
which three binary variables interact in a tripletwise manner. If bivariate marginals are kept 
fixed, and triple interactions are varied, then the joint probability distribution either gains 
or loses a A'O/Mike component, as illustrated in Fig. [2]d. 


III. TRIPLET ANALYSIS OF PAIRWISE INTERACTIONS 

In a triplet of variables Ad, Ad, Ad, three possible binary interactions can exist, quantified 
by /(Xi; Ad), /(Ad; Ad) and /(A 3 ; Xi). In this section we characterize the amount of over¬ 
lap between these quantities, we bound their magnitude, and we learn how to distinguish 
between reducible and irreducible interactions. 


A. Redundancy among the three mutual informations within a triplet 

In the previous section, we saw that when there are only two variables X t and Ad, 
Dy] coincides with the mutual information /(X 1 ;X 2 ). When there are more than two 
variables, D^ can no longer be equated to a mutual information, since there are several 
mutual informations in play, one way per pair of variables: /(X 1 ; X 2 ), /(X 2 ; X 3 ), etc. In this 
section, we derive a relation between all these quantities for the case of three interacting 
variables. The multi-information of Eq. (J5J) decomposes into pairwise and triple interactions, 

Am = Oil + AS, (15) 
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from where we arrive at 


D 


( 2 ) 

123 


A n^ 

2-M23 — u \23 


— 1 12 + -^13 + 1 23 — I 123 — D 


(3) 

' 123 ' 


(16) 


The total amount of pairwise dependencies, hence, is in general different from the sum of 
the three mutual informations. That is, depending on the sign of D \ 23 + J 12 3 , the amount 
of pairwise interactions D \ 23 can be larger or smaller than ^ I %3 . This range of possibilities 

/ ON 

suggests that hj ~ D{ 23 may be a useful measure of the amount of redundancy or synergy 
within the pairwise interactions inside the triplet, and this is the measure that we adopt in 
the present paper. 

This measure coincides with the co-information when there are no triple dependencies, 
that is, when D\ 23 = 0. In this case, 


' 123 


= I 12 + I 


13 


1 t n( 2 ) 
+ 223 - 2 ^ 123 - 


(17) 


Under these circumstances, a positive value of /123 implies that the sum of the three mutual 
informations is larger than the total amount of pairwise interactions. The content of the 
three informations, hence, must somehow overlap. This observation supports the idea that 
a positive co-information is associated with redundancy among the variables. In turn, a 
negative value of J 12 3 implies that although the maximum entropy distribution compatible 

/Q\ 

with the pairwise marginals is not equal to P 1 P 2 P 3 (that is, although D \ 23 > 0), when taken 
two at a time, variables do look independent (that is p l3 ~ p t p 3 ). The statistical dependency 
between the variables of any pair, hence, only becomes evident when fixing the third variable. 
This behavior supports the idea that a negative co-information is associated with synergy 
among the variables. 

( 3 ) 

Of course when D \ 23 > 0, the co-information is no longer so simply related to concepts of 
synergy and redundancy, not at least, if the latter are understood as the difference between 
the sum of the three informations and D\ 23 . However, below we show that in actual data, 
one can often find a close connection between the amount of triple interactions and the 
co-information. 
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B. Triangular binary interactions 


In a group of interacting variables, if X 1 has some degree of statistical dependence with 
X 2 , and X 2 has some statistical dependence with X 3 , one could expect X\ and X :i to show 
some kind of statistical interaction, only due to the chained dependencies X 1 —>■ X 2 X 3 , 
even in the absence of a direct connection. Here we demonstrate that indeed, two strong 
chained interactions necessarily imply the presence of a third connection closing the triangle. 
In the pictorial representation of the middle column of Fig. [TJ this means that if only two 
connections exist (there is no link closing the triangle), then the two present interactions 
cannot be strong. For example, with binary variables, it is not possible to have I 1 2 = I 23 = 1 
bit, and J 3 i =0. The general inequality reads (see the derivation in Appendix [All 

1 12 + An — Hi < I 23 . (18) 


C. Identification of pairwise interactions that are mediated through a third vari¬ 
able 

In the previous section we demonstrated that the chained dependencies X\ X 2 A" 3 
can induce some statistical dependency between X\ and A" 3 . On the other hand, it is also 
possible for X\ and A" 3 to interact directly, inheriting their interdependence from no other 
variable. These two possible scenarios cannot be disambiguated by just measuring the mu¬ 
tual information between pairs of variables. In Appendix [C] we explain how, starting from 
the most general model (illustrated in the lower-right panel of Fig. [[]), the analysis of triple 
interactions allows us to identify those links that can be explained from binary interac¬ 
tions involving other variables, and those that cannot: the so-called irreducible interactions. 
Briefly stated, we need to evaluate whether the interaction between X\ and X 2 (captured 
by the bivariate marginal p± 2 ) and the interaction between X 2 and A" 3 (captured by p 23 ) 
suffice to explain all pairwise interactions within the triplet, including also the interaction 
between Xi and A 3 . To that end, we compute a measure of the discrepancy between the 
two corresponding maximum entropy models, 

^13,23 = #[Pl2,13,23 : Pl3,32] = #13,23 ~ #12,13,23- (19) 
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The amount of irreducible interaction, that is, the amount of binary interaction between X\ 
and X 3 that remains unexplained through the chain X\ o X 2 <->■ A" 3 is dehned as 

A 13 = min{/ 12 , X\l 23 } . (20) 

In Sect. IV Dl we search for pairs of variables with small irreducible interaction, by computing 
A 13 using all possible candidate variables X 2 that may act as mediators. From them, we 
keep the one giving minimal irreducible interaction, that is, the one for which the chain 
Xi o X 2 O X 3 provides the best explanation for the interaction between Ad and X 3 . 


IV. MARGINALIZATION AND HIDDEN VARIABLES 

Imagine we have a system of N variables that are linked through just pairwise interactions. 
In such a system, for any pair of variables Ad, A j there is a third variable Ad producing 
a vanishing irreducible interaction A* J = 0. By selecting a subset of k variables, we may 
calculate the k-th order marginal p k , by marginalizing over the remaining N — k variables. As 
opposed to the original multivariate distribution p N , the marginal p k may well contain triple 
and higher-order interactions. In other words, there may be pairs of variables X*, Aj that 
belong to the subset for which there is no other third variable Ad in the subset producing 
a vanishing irreducible interaction A*- 7 = 0. The high-order interactions in the subset, 
therefore, result from the fact that not all interacting variables are included in the analysis. 
Therefore, triple and higher-order statistical dependencies do not necessarily arise due to 
irreducible triple and higher-order interactions: Just pairwise interactions may suffice to 
induce them, whenever we marginalize over one or more of the interacting variables. An 
example of this effect is derived in Appendix |D] In the same way, marginalization may 
introduce spurious pairwise interactions between variables that do not interact directly, as 
illustrated in Fig. [3] Therefore, even if, by construction, we happen to know that the system 
under study can only contain pairwise statistical dependencies, it may be important to 
compute triple and higher-order interactions, whenever one or a few of the relevant variables 
are not measured. 

Virtually all scientific studies focus their analysis in only a subset of all the variables that 
truly interact in the real system. However, as stated above, neglecting some of the variables 
typically induces high-order correlations among the remaining variables. If such correlations 
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FIG. 3. Examples illustrating the effects of marginalization in a pair of variables (A) or a triplet (B). In each 
case, the variable represented in black drives the other slave variables, which do not interact directly with 
each other (top). However, after marginalizing over the driving variable, a statistical dependence between 
the remaining variables appears. The new interaction can be pairwise (A), or pairwise and tripletwise (B). 

are interpreted within the reduced framework of the variables under study, they are spurious, 
at least, in the sense that there may well be no mechanistic interaction among the selected 
variables that gives rise to such high-order interactions. However, if interpreted in a broader 
sense (i.e., a mathematical fact, that may result as a consequence of marginalization), high- 
order correlations may be viewed as a footprint of the marginalized variables, which are 
often inaccessible. As such, they constitute an opportunity to characterize those parts of 
the system that cannot be described by the values of the recorded variables. 

Below we analyze the statistics of written language. We select a group of words (each 
selected word defines one variable), and we measure the presence or absence of each of 
these words in different parts of the book. For simplicity, not all the words in the book 
are included in the analysis, so the discarded words constitute examples of marginalized 
variables. However, marginalized variables are not always as concrete as non-analyzed words. 
Other non-registered factors may also influence the presence or absence of specific words, 
for example, those related to the thematic topic or the style that the author intended for 
each part of the book. These aspects are latent variables that we do not have access to 
by simply counting words. An analysis of the high-order statistics among the subgroup 
of selected words may therefore be useful to characterize such latent variables, which are 
otherwise inaccessible through automated text analysis. 

As an ansatz, we can imagine that each topic affects the statistics of a subgroup of all 
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the words. The fact that topics are not included in the analysis is equivalent to having 
marginalized over topics. By doing so, we create interactions within the different subgroups 
of words. If the topics do not overlap too much, from the network of the resulting interac¬ 
tions, we may be able to identify communities of words highly connected, that are related 
to certain topics. Variations in the topic can therefore be diagnosed from variations in the 
high-order statistics. 


V. OCCURRENCE OF WORDS IN A BOOK 


Before analyzing a book, all its words are taken in lowercase, and spaces and punctuation 
marks are neglected. Each word is replaced by its base uninflected form using the WordData 
function from the program Mathematica® 30]. In this way, for instance, a word and its 
plural are considered as the same, and verb conjugations are unified as well. 

In order to construct the network of interactions between words, we analyze the probabil¬ 
ity that different words appear near to each other. The notion of neighborhood is introduced 
by segmenting each book into parts. A book containing M words is divided into P parts, so 
that there are M/P words per part. We analyze the statistics of a subgroup of K selected 
words w±,, wk : and define the variables 


if the word ay appears in a part 


Xi = 


( 21 ) 


—1 otherwise. 

The different parts of the book constitute the different samples of the joint probability 
p(xi,X 2 , • • ■ ,xk), or of the corresponding marginals. Notice that if word uy is found in a 
given part of the book, in that sample X t = 1, no matter whether the word appeared one 
or many times. The marginal probability p(ay) = ((ay) + l)/2 is the average frequency with 
which word uy appears in one (any) of the parts. Here, we analyze up to triple dependencies, 
so we work with joint distributions of at most three variables p(xi,Xj,Xk). 

In the present work, we choose to study words that have an intermediate range of fre¬ 
quencies. We disregard the most frequent words (which are generally stop words such as 
articles, pronouns and so on) because they predominantly play a grammatical role, and 
only to a lesser extent they influence the semantic context 31]. We also discard the very 
infrequent words (those appearing only a few times in the whole book), because their rarity 
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induces statistical inaccuracies due to limited sampling [32]. Discarding words implies that 
only a seemingly small number of words are analyzed, allowing us to illustrate the fact that 
even a small number of variables suffices to infer important aspects of the structure of the 
network of statistical dependencies among words. In other types of data, the limitation in 
the number of variables may arise from unavoidable technical constraints, and not from a 
matter of choice. 

We analyzed two books, On the Origin of Species (OS) by Charles Darwin and The 
Analysis of Mind (AM) by Bertrand Russell, both taken from Project Gutenberg website 


33]. Each book was divided into P = 512 parts. In OS, each part contained 295 words, and 


in AM, 175. Parts should be big enough so that we can still see the structure of semantic 
interactions, and yet, the number of parts should not be too small as to induce inaccuracies 
due to limited sampling. 

In both books, we analyzed K = 400 words with intermediate frequencies. For OS, 
the analyzed words appeared a total number of times n.;, with 33 < n* < 112. For AM, 
we analyzed words with 21 < < 136. Since for these words the number of samples 

(parts) is much greater than the number of states (2), entropies were calculated with the 
maximum likelihood estimator. We are able to detect differences in entropy of 0.01 bits, 
with a significance of a = 0.1% (see Appendix [E] for a analysis of significance). A Bayesian 
analysis of the estimation error due to finite sampling was also included, allowing us to 
bound errors between 0.005 bits and 0.01 bits, depending on the size of the interaction (see 
Appendix |F]). 


A. Statistics of single words 

Before studying interactions between two or more words, we characterize the statistical 
properties of single words. Specifically, we analyze the frequency of individual words, and 
their predictability of its presence in one (any) part of the book. Within the framework of 
Information Theory, the natural measure of (un)predictability is entropy. 

Using the notation p t = p(xf), the entropy Hi is 

Hi = -(1 - Pi) log 2 (l - pf) - Pi log 2 pi. (22) 

This quantity is maximal (H = 1 bit) when pi = 1/2, that is, when the word Wi appears 
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in half of the parts. When uy appears in either most of the parts or in almost none, Hi 
approaches zero. For all the analyzed words, 0 < pi < 1/2. In this range, the entropy H is 
a monotonic function of p t . 

The value of pj, however, is not univocally determined by the number n t of times that 
the word Wi appears in the book. If Wi appears at most once per part, then p t = rti/P. If 
Wi tends to appear several times per part, then pi < rii/P. 

In addition, one can determine whether the fraction of parts containing the word is in 
accordance with the expected fraction given the total number of times n* the word appears 
in the whole book. If rii is half the number of parts (that is, Hi = P/2), then pi = 1/2 implies 
that the rii words are distributed as uniformly as they possibly can: Half of the parts do not 
contain the word, and the other half contain it just once. If, instead, rii = 100P, a value of 
Pi = 1/2 corresponds to a highly non-uniform distribution: The word is absent from half of 
the parts, but it appears many times in the remaining half. 

To formalize these ideas, we compared the entropy of each selected word with the entropy 
that would be expected for a word with the same probability per part 1/P, but randomly 
distributed throughout the book and sampled rii times. The binomial probability of finding 
the word k times in one (any) part is 


AM 





(23) 


Equation (J22J) describes an integer variable. In order to compare with Eq. (122|) . we define Y t 
as the binary variable measuring the presence/absence of word wy in one (any) part, assuming 
that the word is binomially distributed. That is, Yj = 0 if k — 0, and Yj = 1 if k > 0. The 
marginal probability ofp(Y) = 1) is p(k > 0) = 1 —(1 —1/P) n \ This formula is also obtained 
when all the words in the book are shuffled. In this case pi{k) follows a hypergeometric 
distribution, such that p t (k = 0) = (m/p)/(m7p) = ~ - 0-~ l/P) n S where 

the last equality holds when M rii. 

Hence, the entropy of the binary variable associated with the binomial (or the shuffled) 
model is 


tffnomial(y.) = _(1 _ l/P)"Hog 2 ((1 - l/Pf 1 ) - (1 - (1 - 1/P) ni ) log 2 (1 - (1 - l/P)" 4 ).. (24) 

The entropy of the variable X % measured from each book is compared with the entropy of 
the binomial-derived variable Y, in Fig. [4j 
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FIG. 4. Entropy of the 400 selected words in each book (one data point per word), compared to the 
expected entropy for a binomial variable with the same total count n,; (continuous line), as a function of the 
total count. Entropies are calculated with the maximum likelihood estimator. The analytical expression of 
Eq. (l24l) is represented with the black line, and the gray area corresponds to the percentiles l%-99% of the 
dispersion expected in the binomial model, when using a sample of rq words. Data points outside the gray 
area, hence, are highly unlikely under the binomial hypothesis, even when allowing for inaccuracies due to 
limited sampling. A: OS. B: AM. 

Even if the process were truly binomial, the estimation of the entropy may still fluctuate, 
due to limited sampling. In Fig. HJ the gray region represents the area expected for 98% of 
the samples under the binomial hypothesis. We expect 1% of the words to fall above this 
region, and another 1%, below. However, in OS, out of 400 words, none of them appears 
above, and 15% appear below. In AM, the percentages are 0% and 16.5%. In both cases, 
the outliers with small entropy are 15 times more numerous than predicted by the binomial 
model, and no outliers with high entropy were found, although 4 were expected for each book. 
In both books, hence, individual word entropies were significantly smaller than predicted by 
the binomial approximation, implying that they are not distributed randomly: In any given 
part, each word tends to appear many times, or not at all. 

A list of the words with highest difference (// ? bmomial —//,:) is shown in Tablc[l] Interestingly, 
most of these words are nouns, with the Erst exception appearing in place 10 (the adjective 
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“rudimentary”) for OS. As reported previously 31], words with relevant semantic content 
are the ones that tend to be most unevenly distributed. 


B. Statistics of pairs of words 

In principle, there are two possible scenarios in which the mutual information between 
two variables can be high: (a) in each part of the book the two words either appear together 
or are both absent, and (b) the presence of one of the words in a given part excludes the 
presence of the other. In Table UTl we list the pairs of words with highest mutual information. 
In all these cases, the two words in the pair tend to be either simultaneously present or 
simultaneously absent (option (a) above). 

The words listed in Tabic HU are semantically related. In both books, there are examples of 
words that participate in two pairs: cell is connected to both bee and wax (OS) and mnemic 
is connected to both phenomena and causation (AM). These examples keep appearing if the 
lists of Table nn are extended further down. Their structure corresponds to the double links 
in the second and third columns of Figs. [TJ3 and[lp. As explained in Sect. HUB! two strong 
binary links imply that the third link closing the triangle should also be present. Indeed, 


TABLE I. Words with highest difference in entropy A Hi = //* ,momial — Hi, expressed in bits. Left: 
OS. Right: AM. 


Word (OS) 

A Hi 

Word (AM) 

A Hi 

bee 

0.369 

proposition 

0.335 

cell 

0.365 

appearance 

0.315 

slave 

0.302 

box 

0.299 

stripe 

0.295 

datum 

0.258 

pollen 

0.275 

animal 

0.240 

sterility 

0.266 

objective 

0.215 

pigeon 

0.252 

star 

0.211 

fertility 

0.248 

content 

0.206 

nest 

0.242 

emotion 

0.205 

rudimentary 

0.234 

consciousness 

0.204 
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in OS, america is linked to both south and north (rows 2 and 4 of Table UTT) . The words 
south and north are also linked to each other, but they only appear in position 32, with a 
mutual information that is approximately 1/3 of the two principal links. A similar situation 
is seen with bee and wax, both connected to cell, although the direct connection between 
bee and wax appears sooner, in position 16. The same happens in AM with phenomena and 
causation, linked through mnemic, which are connected to each other in the 39th place of 
the list. These examples pose the question whether the weakest link in the triangle could be 
entirely explained as a consequence of the two stronger links. A triplet analysis of pairwise 
interactions allows us to assess whether such is indeed the case (see Sect. IIII CD . 

We finish the pairwise analysis with a graphical representation of the words that are 
most strongly linked with pairwise connections (left panels of Fig. [5]). Words belonging to 
a common topic are displayed in different grey levels (different colors, online), and tend to 
form clusters. In each cluster (insets in Fig. [5]), triplets of words often form triangles of 
pairwise interactions. In the central plot, and in the top graph of each inset, the width of 
each link is proportional to the mutual information between the two connected words. 


TABLE II. Pairs of words with highest mutual information. Left: OS. Right: AM. The values are 
in bits. 


Wi (OS) 

Wj (OS) 

Iij 

Hi 

Hj 

Wi (AM) 

Wj (AM) 

Iij 

Hi 

Hj 

male 

female 

0.242 

0.504 

0.409 

1 

2 

0.191 

0.330 

0.337 

south 

america 

0.210 

0.480 

0.560 

truth 

falsehood 

0.110 

0.429 

0.191 

reproductive 

system 

0.152 

0.290 

0.474 

response 

accuracy 

0.107 

0.306 

0.264 

north 

america 

0.133 

0.429 

0.560 

depend 

upon 

0.107 

0.229 

0.616 

cell 

wax 

0.122 

0.201 

0.150 

mnemic 

phenomena 

0.095 

0.423 

0.516 

bee 

cell 

0.120 

0.330 

0.201 

mnemic 

causation 

0.090 

0.423 

0.381 

fertile 

sterile 

0.120 

0.345 

0.330 

consciousness 

conscious 

0.089 

0.504 

0.352 

deposit 

bed 

0.109 

0.322 

0.314 

door 

window 

0.086 

0.160 

0.128 

fertility 

sterility 

0.109 

0.352 

0.322 

stimulus 

response 

0.085 

0.474 

0.306 

southern 

northern 

0.107 

0.306 

0.264 

pain 

pleasure 

0.079 

0.171 

0.181 


21 











FIG. 5. Central graph: Network of pairwise interactions in OS. Width of links proportional to the mutual 
information between the two connected words. Insets: Detail of selected subnetworks. Top graph: links 
proportional to mutual information. Bottom graph: links proportional to irreducible interaction. 

C. Statistics of triplets 

In order to determine whether triple interactions provide a relevant contribution to the 

(o\ 

overall dependencies between words, we compare D\^ k with the total amount of pairwise 

( 2 ) 

interactions within the triplet, D\^ k . 
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FIG. 6. Fraction of the total interaction within a triplet A ijj. that corresponds to tripletwise dependencies, 
D^ k /Aijk, as a function of the total interaction. The grey level of each data point is proportional to the 
(logarithm of the) number of triplets at that location (scale bars on the right). A,^ values above 0.01 bits 
are significant (see Appendix). A: OS. B: AM. Dashed line: averages over all triplets with the same A^*,. 

Figure |6] shows the fraction of the total interaction that corresponds to triple dependen¬ 
cies, D^ k /Aijk, as a function of the total interaction A^-*,. The data extends further to the 
right, but the triplets with > 0.05 bits are less than 0.4%. The first thing to notice is 
that the values of the total interaction (values in the horizontal axis) are approximately an 
order of magnitude smaller than the entropies of individual words (see FigQJ. Individual 
entropies range between 0.1 and 0.9 bits, and interactions are around 0 and 0.05. In order 
to get an intuition of the meaning of such a difference, we notice that if we want to know 
whether words Wi, Wj and w k appear in a given part, the number of binary questions that 
we need to ask is (depending on the three chosen words) between 0.3 and 2.7 if we assume 
the words are independent (H, + Hj + H k ), and between 0.25 and 2.2, if we make use of 

/o\ 

their mutual dependencies (Hi + Hj + H k — A) 2 ; 3 ). Although sparing « 10% of the questions 
may seem a meager gain, it can certainly make a difference when processing large amounts 
of data. 

The second thing to notice, is that triple interactions are by no means small as compared 
to the total interactions within the triplet, since there are triplets with D) 3k f A l]k of order 
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unity. In other words, triple interactions are not negligible, when compared to pairwise 
interactions. In the triplets with D\- k /Aijk m 1, the departure from the independent as¬ 
sumption resembles the XOR behavior (or — XOR ), in the sense that the states ( xi,x 2 ,x 3 ) 
for which Y\ i Xi = l have a lower (higher) probability than the states with \\i x i = — 1- The 
first case corresponds to triplets where all pairs of words tend to appear together, but the 
three of them are rarely seen together. In the second case, the words tend to appear either 
the three together or each one on its own, but they are rarely seen in pairs. 

Table Ell shows the words with largest triple information. These interactions are well 
above the significance threshold of 0.01 bits. The triplet ( america , south , north ) is similar 
to a XOR gate, so these words tend to appear in pairs but not all three together. In certain 
contexts the author uses the combination south america , in other contexts, north america, 
and yet in others, he discusses topics that require both south and north but no america. 

Most of the triplets in Table UTTI have triple information values that are equal in magnitude 
to the co-information but with opposite sign, that is, ~ —Rjk- Besides, for these triplets, 
most of the interaction is tripletwise, that is, D'jjJ A [2 ;i ~ 1. To determine whether such 


(3) 

TABLE III. Words with highest triple information D jjk . The first column displays a tag that allows 
us to identify each triplet in Fig. E The last column indicates whether the triplet behaves as a 
XOR gate (+1) or a —XOR (—1). Top: OS. Bottom: AM. Values in bits. 


Tag 

i 

3 

k 

D m 

lijk 

D®/ A 

XOR 

a 

america 

south 

north 

0.065 

0.005 

0.16 

+1 

P 

inherit 

occasional 

appearance 

0.040 

-0.040 

0.96 

-1 

7 

action 

wide 

branch 

0.036 

-0.036 

0.93 

-1 

5 

europe 

perhaps 

chapter 

0.036 

-0.036 

0.90 

-1 

e 

climate 

expect 

just 

0.035 

-0.035 

0.97 

-1 

a 

speak 

causation 

appropriate 

0.041 

-0.041 

0.93 

-1 

P 

sense 

perception 

natural 

0.033 

-0.033 

0.90 

-1 

7 

since 

actual 

wholly 

0.033 

-0.033 

0.90 

-1 

5 

wish 

me 

connection 

0.033 

-0.033 

0.95 

-1 

e 

consist 

should 

life 

0.033 

-0.033 

0.92 

-1 
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FIG. 7. Triple information Dfj k as a function of the co-information I l3 k for all triplets. The grey level of 
each data point is proportional to the (logarithm of the) number of triplets at that location (scale bars on 
the right). \jk values above 0.01 bits are significant (see Appendix). A: OS. B: AM. 

tendency is preserved throughout the population, in Fig. [7] we plot the triple information 
D\-' k as a function of the co-information I l3 k for all triplets. We see that the vast majority of 
triplets are located along the diagonal D\- k « — I \ 3k . In order to understand why this is so, 
we analyze how data points are distributed when picking a triplet of words randomly. The 
cases A, B, C and D of Fig. |T]are ordered in decreasing probability. That is, picking three 
unrelated words (Fig. [TJA) has higher probability that picking a triplet with only pairwise 
interactions (B), which is still more likely than picking a case with only triple interactions 
(C), leaving the case of double and triple interactions (D) as the least probable. All cases 
with no triple interaction (A and B) fall on the horizontal axis Dl k = 0 in Fig. [TJ Therefore, 
in order to understand why points outside the horizontal axis cluster along the diagonal we 
must analyze the triplets that do have a triple interaction (panels C and D in Fig. [I]). We 
begin with case C, because it has a higher probability than case D. This case corresponds to 
Dijl > 0 and I i3 = Ijk = Iki ~ 0. It is easy to see that in these circumstances, p 2 ~ PiPjPk, 
and hence, D, 3k ps —Iijk- We continue with the left column of case D, since having a single 
pairwise interaction has higher probability than having more. This case corresponds to 

/o\ 

D ljk > 0, hj = « 0 and Iki > 0, for some ordering of the indexes i,j,k. In these 
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circumstances, p 2 ~ PijPikPjk/PiPjPk, which again implies that ~ — 1^. Therefore, all 
triplets containing some triple interaction and at most a single pairwise interaction fall along 
the diagonal in Fig. [71 The only outliers are triplets with D^ k > 0 and at least two links 
with pairwise interactions, which, as derived in Sect. IIII E>l most likely contain also the third 
pairwise link. Such highly connected triplets are typically few. 

From Eq. (fl6|) we see that the triplets that are near the diagonal are neither synergistic nor 
redundant, that is, I t j + Ij k + I k i ~ D \-' k . Those located above the diagonal have redundant 
pairwise information ( 1^ + Ij k + I ki > D^ k ), whereas those below are synergistic. In the two 
analyzed books, very few (~ 10) triplets satisfy Uj — D^ < —0.01 bits. Contrastingly, 
~ 300 triplets have significant redundant pairwise information I Vj — D C) > 0.01 bits). 
The triplets located far from the diagonal correspond, in both cases, to those with a large 
total dependency (A > 0.1 bits). Table [TV] displays the words with highest redundant 
pairwise interaction, that is, Jjj + Ij k + I ki — Dlj k - With the exception of data point a 
( america , south , north), the triplets that have highest redundancy tend to be in the lower 
right part of Fig. [71 whereas the ones with highest triple interaction lie in the upper left 


( 3 ) ( 2 ) 

TABLE IV. Triplets with highest redundant pairwise information Dij k + Iij k = hj+Ijk+Iki — D\j k - 
The first column displays a tag that allows us to identify each triplet in Fig. [TJ Top: OS. Bottom: 


AM. Values 

in bits. 




Tag 

i 

3 

k 

Dijk + lijk 

c 

bee 

cell 

wax 

0.089 

a 

america 

south 

north 

0.070 

V 

glacial 

southern 

northern 

0.065 

e 

mountain 

glacial 

northern 

0.062 

K 

male 

female 

sexual 

0.057 

C 

leave 

door 

window 

0.061 

V 

stimulus 

response 

accuracy 

0.039 

9 

mnemic 

phenomena 

causation 

0.038 

K 

truth 

false 

falsehood 

0.036 

A 

place 

2 

1 

0.027 
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corner. 


D. Identification of irreducible binary interactions 


Using the tools of Sect 1III Cl here we identify the pairs of words that interact only because 


the two of them have strong binary interactions with a third word. In the first place, the 
pairs of words whose mutual information is larger than the significance level (0.01 bits) are 
selected. For those pairs, the irreducible interaction is calculated by considering all other 
candidate intermediary words, and selecting the one that minimizes Eq. (120]) . We observe 
that many pairs have a low irreducible interaction, implying that their dependency can be 
understood by a path that goes through a third variable X k , such as 



(25) 


In these situations, the behavior of the pair {X i: Xj} can be predicted from the dependency 
between {A 7 *, X k } and the dependency between {X k ,Xj}. 

In Table lYl we list the pairs ( i,j ) of words that have smallest irreducible interaction, 
including the third word ( k ) that acts as a mediator. In these triplets, most of the interaction 
between words wy and Wj is explained in terms of w k - Mediators tend to have a high semantic 
content, and to provide a context in which the other two words interact. Besides, the triplets 
(i,j, k ) in Table M tend to cluster in the lower right corner of Fig. [71 implying that pairs of 
words share redundant mutual information. 

The number of pairs with significant mutual information (i.e., lij > 0.01 bits), and whose 
interaction is explained at least in a 90% through a third word (i.e., A ^/hj < 0.1) is higher 
in the book OS (108) than in book AM (19). Out of the 108 pairs of OS, 16 are explained 
through the word cell , 12 through america , 8 through northern, 6 through glacial, 6 through 
sterility and so on. The fact that specific words tend to mediate the interaction between 
many pairs suggests that they may act as hubs in the network. 

In the right panels of Fig. El we see the network of irreducible interactions. When com¬ 
pared with the network of mutual informations (left panels), the irreducible network contains 
weaker bonds, as expected, since by definition, A^ cannot be larger than I l3 . In the figure, 
we can identify some of the pairs of Table [V] whose interaction is mediated by a third word. 
Such pairs appear with a significantly weaker bond in the right panel, as for example, bee- 
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wax (mediator = cell, OS), and stimulus-accuracy, (mediator = response, AM). Moreover, 
one can also identify the pairs whose interaction is intrinsic (that is, not mediated by a third 
word) as those where the link on the right has approximately the same width as on the left. 
Notable examples are male-female (OS), and depend-upon. 


VI. CONCLUSIONS 

In this paper, we developed the information-theoretical tools to study triple dependencies 
between variables, and applied them to the analysis of written texts. Previous studies had 
proposed two different measures to quantify the amount of triple dependencies: the co¬ 
information Iijk and the total amount of triple interactions ZT 3 h Given that there is a 
certain controversy regarding which of these measures should be used, it is important to 
notice that Iijk is a function of three specific variables X\, X%, A 3 , whereas D ^ is a global 
measure of all triple interactions within a wider set of N variables, with N > 3. Therefore, 
it only makes sense to compare the two measures when D-' V) is calculated for the same group 
of variables as I^k, which implies using N = 3. 

The two measures have different meanings. Whereas the co-information quantifies the 


TABLE V. Pairs of words with lowest irreducible interaction. The first column displays a tag that 
allows us to identify each triplet in Fig. [71 Top: OS. Bottom: AM. Values in bits. 



i 

3 

lij 

A ij 

famed 

c 

bee 

wax 

0.093 

0.003 

cell 

a 

south 

north 

0.071 

0.001 

america 

A 

continent 

south 

0.032 

0.001 

america 

h 

lay 

wax 

0.032 

0.000 

cell 

V 

southern 

arctic 

0.031 

0.001 

northern 

e 

phenomena 

causation 

0.042 

0.004 

mnemic 

V 

stimulus 

accuracy 

0.039 

0.000 

response 

A 

place 

2 

0.028 

0.000 

1 

h 

proposition 

falsehood 

0.024 

0.002 

truth 

V 

proposition 

door 

0.022 

0.000 

window 








effect of one (any) variable in the information transmission between the other two, the 
amount of triple interactions measures the increase in entropy that results from approxi¬ 
mating the true distribution by the maximum-entropy distribution that only contains 
up to pairwise interactions. When studied with all generality, these two quantities need not 
be related, that is, by fixing one of them, one cannot predict the value of the other. When 
restricting the analysis to binary variables, however, a link between them arises. Three 
binary variables are characterized by a probability distribution over 2 3 possible states. Due 
to the normalization restriction, the distribution is determined once the probability of 7 
states are fixed. Choosing those 7 numbers is equivalent to choosing the three entropies 
Hi, Hj, Hk, the three mutual informations /, :j , and one more parameter. This extra 

parameter can be either the co-information (in which case the triple interaction is 
fixed), or the triple interaction (in which case the co-information Rjk is fixed). Hence, 
although in general the co-information and the amount of triple interactions are not related 
to one another, for binary variables, once the single entropies and the pairwise interactions 
are determined, Ujk and become linked. In this particular situation, hence, there is no 
controversy between the two quantities, because they both provide the same information, 
only with different scales. 

Moreover, we have shown that when pooling together all the triplets in the system, and 
now without fixating the value of individual entropies or pairwise interactions, I l3 k and D^ 
often add up to zero. This effect results from the fact that most triplets contain at most a 
single pairwise interaction. Hence, for most of the triplets the two measures provide roughly 
the same information. The exception involves the triplets containing at least two binary 
interactions, which are likely to contain all three interactions, in view of Sect. IIIIB1 

One could repeat the whole analysis presented here, but with X t = number of times the 
word appeared in a given part (instead of the binary variable appeared / not appeared). 
This choice would transform the binary approach into an integer description, which could 
potentially be more accurate, if enough data are available. It should be borne in mind, 
however, that the size of the space grows with the cube of the number of states, so serious 
undersampling problems are likely to appear in most real applications. We choose here the 
binary description to ensure good statistics. In addition, this choice allowed us to (a) relate 
triple interactions with the ±XOR gate, and (b) related the co-information with the amount 
of triple interactions. 
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In the present work we studied interactions between words in written language through 
a triple analysis. This approach allowed to accomplish two goals. First, we detected pure 
triple dependencies that would not be detectable by studying pairs of variables. Second, we 
determined whether pairwise interactions can be explained through a third word. 

We found that on average, 11% and 13% of the total interaction within a group of 
three words is pure tripletwise. On average, triple dependencies are weaker than pairwise 
interactions. However, in 7% and 9% of the total number of triplets, triple interactions are 
larger than pairwise. Although this is a small fraction of all the triplets, all the 400 selected 
words participate in at least one such triplet. Hence, if word interactions are to be used to 
improve the performance in a Cloze test, triple interactions are by no means negligible. 

We believe that in particular for written language the presence of triple interactions is 
mainly due the marginalization over the latent topics. For example, the triplet ( america , 
south , north ) resembles a XOR gate, so variables tend to appear two at a time, but not 
alone, nor the three together. Imagine we include an extra variable (this time, a non-binary 
variable), specifying the geographic location of the phenomena described in each part of the 
book. The new variable would take one value in those parts where Darwin describes events 
of North America, another value for South America, and yet other values in other parts of 
the globe. If these topic-like variables are included in the analysis, the amount of high order 
interactions between words is likely to diminish, because complex word interactions would 
be mediated by pairwise interactions between words and topics. However, since topic-like 
variables are not easily amenable to automatic analysis, here we have restricted the study 
to word-like variables. We conclude that high-order interactions between words is likely to 
be the footprint of having ignored (marginalized) over topic-like variables. 
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Appendix A: Mathematical proofs 


a. Derivation of the bound in Eq. m 


As imposing more restrictions cannot increase the entropy, /A2,23,31 < H 12 ,23- Using the 
fact that i/12,23 = ifi2 + H03 — H 2 (see Appendix [B]), it follows from Eq. (J 7 |) that 

-D123 — -^12,23 ~~ iil 23 

(Al) 

^123 - ^13|2- 

This inequality is tight, since a probability distribution exists for which the equality is 
fulfilled: when if 12 , 2 3 = iii2,23,3i, that is, when p i 2 ,23,31(^1, x 2 , x 3 ) = p 12 p 23 /P2- 

The derivation can be done removing any of the restrictions V G { 12 , 13 , 23 }. Therefore, 

-D 123 < min{/i2|3, 1 23 \i, ii3|2} 

(A 2 ) 

-D123 A min{/ 12 , /13,123} — A23j 

where /123 is the co-information. From Eq. (IA 2 D . it also follows that 




(A 3 ) 


b. Derivation of Eq. {Z2Jj 

Inserting the upper bound of Eq. (1A1D in Eq. (TT 6 T) , 

ii 2 + I23 + i 3 i = A23 + -D123 + -U123 

A 1123 + ii 123 + -^2311 

= I23 ~ + -D123 + (A4) 

Therefore, 

ii2 + i 3 i A -U123• (A 5 ) 

In addition, since reducing the number of marginal restrictions cannot diminish the entropy 
of the maximum entropy distribution, 

D ( i% = -H\p 12,23,31] + ifi + H 2 + H 3 
< — H\p 23 \ + Hi + H 2 + H 3 

— I 23 + H\. (A6) 


31 




Combining Eqs. (1A5|) and (1A6H . 


1 12 + hi ~ Hi < I 


23- 


Therefore, if J 12 and hi are large, hi cannot be too small. 


Appendix B: Maximum entropy solution 


The problem of finding the probability distribution that maximizes the entropy under 
linear constrains, such as fixing some of the marginals, has a unique solution [27]. Although 
no explicit closed form is known for the case where each variable varies in an arbitrary do- 


27], that converge 


main, there are procedures, for example the iterative proportional fitting 
to the solution. 

In some special cases a closed form exists. For example, when the univariate marginals 
are fixed, the solution is the product of such marginals. Another case is when we look for the 
maximum entropy distribution of three variables p(xi, x 2 , £ 3 ) that satisfies two constraints— 
for example p(x 15X2) and p(x 2 ,x 3 ) —out of the three bivariate marginals. Posing the maxi¬ 
mization problem through Lagrange multipliers, we obtain a solution of the form 


P(x i,x 2 ,x 3 ) = fi(x 1 ,x 2 )f 2 (x 2 ,x 3 ). 

If we enforce the marginal constrains and the normalization, we get 

p(xi, x 2 )p(x 2 , x 3 ) 


(Bl) 


p(x l,x 2 ,x 3 ) = 


p(x 2 ) 


(B2) 


which is known as the pairwise approximation. The entropy of this distribution is 


H\p\ — i/ 12,23 — + H 23 — H 2 . 


(B3) 


Below we derive the solution p( 2 \xi, x 2 , x 3 ) in the special case of three binary variables 
(Xj = ±1). This solution has maximum entropy and satisfies the three second order marginal 
constrains, p(xi,x 2 ), p(xi,x 2 ) and p(x 2 ,x 3 ). In principle, eight variables need to be deter¬ 
mined, one for the probability of each state. However, considering the normalization condi¬ 
tion, the constraints on the three univariate marginals, and on the three bivariate marginals, 
we are left with only a single free variable. As shown in previous studies 18], 119], the problem 


reduces to finding the root of a cubic equation. Since we are interested in comparing this 
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solution with the joint probability p(x i,X2,x 3 ), a convenient and conceptually enlightening 
way of expressing the solution p^ 2 \x i,X 2 ,x 3 ), as in the work of Martignon 18], is 


P ,2) Ui, 12 ,^ 3 ) = p(xi,x 2 ,x 3 ) - S JJa*. 


(B4) 


where the value of 5 is such that the probabilities remain in the simplex, that is, G 

[0,1]. For the marginals, we get 


p (2 \x i ,x j ) = p { - 2 \x i ,x j , 1) +p {2 \x h Xj, -1) 

= p(xi , xj, 1 ) + p(xi , Xj, - 1 ) -5 + 5 
= p(Xi,Xj). 

The value of 5 is obtained from 

p {2 \x i,x 2 ,x 3 )= ]^[ p { - 2 \x i,x 2 ,x 3 ), 

x/fii^^ 1 x /n * aci=—1 


(B5) 


(B 6 ) 


condition ensuring that the coefficient accounting for the triple interaction in the log-linear 


model vanishes 


19]. Eq. (1B6|) reduces to the previously mentioned cubic equation on 5. 


If the solution is 5 = 0, then the probability p is the one with maximum entropy. Other¬ 
wise, the probability p departs from p^ 2 \ implying that, up to a certain degree, the multi¬ 
variate distribution resembles either the XOR gate, or its opposite. 

We close this section by discussing the effect of varying the amount of triple interactions 
while keeping all bivariate marginals fixed, as discussed in Sect. Ill Al There we proved that 
when p(xi,X 2 , x 3 ) took the shape of Eq. (TT3|) . then the amount of triplet interactions was a 
measure of the similarity between the joint distribution and a ±XOR distribution. Here we 
extend this result to arbitrary distributions. We have demonstrated here that p(x \ 1 X2 1 x 3 ) 
can always be written as p(x i,X 2 ,x 3 ) oc p^(xi, X 2 , x 3 ) + Sx iX 2 X 3 , where p ly2 \x 1 , 2 +, £ 3 ) 
is the maximum entropy model compatible with the bivariate marginals of the original 
distribution, and 5 is a certain constant. Amari showed that if 5 = 0, there are no triple 
interactions. Pushing his argument further, here we notice that if the bivariate marginals 
are kept fixed, the only way of changing the amount of triple interactions is to vary the 
value of 5. The size of 5 determines the degree of similarity between p(xi,X2,x 3 ) and a 
±XOR distribution. Therefore, once the bivariate marginals are fixed, the only parameter 
that can be manipulated in order to change the amount of triple interactions is the one that 
quantifies the size of the ±XOR component. 
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Appendix C: Irreducible interactions 


Following the ideas from 22 ,( 34 ], we wish to detect whether the statistical dependencies 


among a group of variables V = {X l5 ..., X k } contain all possible interactions, or whether 
some of the interactions can be derived from others. All possible interactions are defined by 
the power set of V, that is, the set whose elements are all the possible subsets of elements of 
V. If some interactions can be explained in terms of others, then some groups of variables 
in V are independent from other groups, and the set that defines all present interactions is 
smaller than the power set. To identify the subsets of variables whose dependencies suffice 
to explain all interactions, we propose different structured sets 12 = {Ui, U 2j ..., U(}, where 
each Ui = {X tl ,..., X lk } is itself a set of variables that may or may not belong to V. Each 
set 12 is a candidate explanation of the statistical structure in V. Within the maximum 
entropy approach, for each proposed 12 we calculate 


= D\p nuv : pa] 

= Ha — Ha uV i 


(Cl) 


where we are using the notation described in the previous section, so that pa is the 
maximum entropy distribution compatible with the marginals of the groups of variables 
Ui, U 2 , ■ ■ ■ ,Ui contained in 0, and pauv is the maximum entropy distribution compatible 
with the marginals of Ui, ■ ■ ■ , Uf , V. If is zero, then pauv — Pn, an d the joint probabil¬ 
ity of the variables V can be derived from fh This means that the statistical dependencies 
among the groups that compose 12 suffice to explain the statistical structure among the 
groups that compose V, even if the former contains interactions whose order is smaller than 
the number of elements in V. 

In the simplest example, we want to decide whether the statistical structure in the pair¬ 
wise marginal pi 2 = p(Xi,X 2 ) may or may not be explained by the univariate marginals 
Pi = p(Xi) and p 2 = p(X 2 ). In this case, V = {Xi,X 2 } and 12 = {Ui,U 2 }, with 
U\ — {Xi},U 2 = {X 2 }. When calculating the union 12 U V, we notice that here the sign 
U represents a union of marginals, not a union of sets. The bivariate marginal pi 2 contains 
the univariate marginals p\ and p 2l so 12 U V = V. Hence, 


A \% = D\p 12 :p lt2 ] = I{X 1 -X 2 ). 


(C 2 ) 
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If A 3 2 2 = 0, the entire statistical structure within V is accounted for by the two independent 
variables Ad and Ad. 

In a more complex example, we may wish to determine whether the statistical dependen¬ 
cies between the variables Xi,X 2 and A " 3 can be explained by just first and second order 
interactions. We define V = {X 2j X 2 , X 3 } and = {UijU 2 ,U 3 }, with U\ = {Ad, Ad}, 
U 2 = {X 2 ,X 3 }, U 3 = {Ad, Ad}. The triple marginal P 123 contains all pairwise marginals 
Pi 2 ,p 23 and p 3 1 , so again, U V = V. Therefore, 

^12^3,23 = ^[Pl23 : Pl2,13,23] = ^123- (C3) 


If A 32 3 13 23 = 0; pairwise interactions suffice to explain all the statistical structure in V. 

A less ambitious goal would be to determine whether the statistical dependence between 
Ad and Ad is mediated by a third variable Ad- We hence define V = {Ad, Ad}, II = {Ui,U 2 }, 
and U\ = {Ad, Ad}, U 2 = {Ad, Ad}. The union of marginals is now flUE = {V, Id, U 2 } d V, 
so in this case, A 3 § 23 is given by Eq. (1T9j) . 

The set H constitutes a candidate explanatory model for the statistical dependencies 
within V. The aim is to find the simplest set for which A^ = 0. The search for such II, 
however, has to be done within the power set of the set that includes all the variables in the 
system, so the number of candidate If sets grows exponentially with the number of variables. 
Since for a large system the search becomes computationally intractable, here we restrict 
the analysis to the study of pairwise dependencies, that is, sets V with just two elements. 
Moreover, we search for explanatory models that attempt to reproduce all the statistical 
structure in V by means of pairwise interactions with a third variable, as in Eq. (fT9|h A 


similar approach, but within a different theoretical 


disambiguating couplings in oscillatory systems 


framework, has been proved useful in 
35]. We define the amount of irreducible 


interaction between the variables X t and X 3 as the amount of statistical dependencies that 
remain unexplained by the optimal minimal model, that is, 


A lJ = min 


A]A, minj A^.}} 


mm <; Iij, mm 

k 


{as*}} , 


(C4) 


= nun <; 1 ^, mm 


{ IlyA'.A'y H ijd k,ki} J* ' 
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The index k ranges through all the variables that do not coincide with i or j (k 7 ^ i, k 7 ^ j). 
By defining A *- 7 as a Kullback-Leiber divergence, its non-negativity is ensured. Besides, the 
minimization in Eq. (104(1 ensures that A iJ is upper bounded by the mutual information, 
that is, A *- 7 < Iij. Expanding A% kjl 

^ik.kj H/k T H kj 11 k Hij,jk,ki 

Hik T Hjk Hk Hijk + Hijk kkjj jk kj 


= /, 


ij\k 


D 


(3) 

ijk' 


Therefore, if there are not triple interactions within the whole set of variables, then A 1 - 7 
correspond to conditioning the mutual information between i and j with every other possible 
variable k, and looking for the minimum. We can rewrite Eq. (1C4j) as 

A h =J y -e(max{j 0 -* + £>®}) 


= 1 ^ - 0 (max 1 1 ^ + Ijk + hi ~ Difk}) 


(06) 


where Q(x) is the Heaviside step function. I 11 this sense, we are looking for a triplet that 
has maximal redundancy, understanding redundancy as 1 — 


Appendix D: Example of marginalization effects 


Consider four binary variables X, t = ±1, which can be thought of as spins, with only 
pairwise interactions between X 4 and each of the other three variables. The fourth variable 
is in the up state with probability (1 + e _2/3 ) _1 . Here we focus in negative f3 values, which 
favor the down state. The joint probability can be written as a log-linear model [l7, li| 


logp(xi, X 2 , X 3 , X 4 ) = (3X4 + X1X4, + X2X4 + X 3 X4 - 


(Dl) 


= (p + x 1 + x 2 + ^ 3)^4 - 

where ft < 0 is the field acting on X 4] and if> is the normalization constant. Marginalizing 
over X 4 , we obtain 


p(x !,X 2 ,X 3 ) 


cosh(/3 + xi + x 2 + x 3 ) 
cosh(/3 + x\ + x ' 2 + x' 3 )' 


(D2) 
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With this probability we are able to calculate the interactions A 123 , D^ 23 an d Du 3 as a 
function of (3. 



FIG. 8. Interactions A123, D ^ 2 ) 3 and D ^ 2 3 as a function of the field (3 acting on X 4 . 

In Fig. [ 8 ] we see the multi-information A 123 , the amount of pairwise interactions in the 
triplet D\ 23l and the triple information D 123 as a function of the field (3 acting on X 4 . As 
stated above, A 12 3 = D \ 23 + D\ 23 . All of these quantities are obtained from the marginal 
probabilities p(x\, X 2 , £ 3 ) given by Eq. (11)21) (see Appendix ITU). When the field is strong 
((3 —y — 00 ) the total amount of interaction vanishes, as all spins align in the down state. 
For small values of the field, the amount of interactions is large, and can be explained almost 
entirely by pairwise dependencies. However for intermediate values of the field (see inset of 
Figure [ 8 j), which corresponds to the fourth spin aligned downwards most of the time, the 
triple information is crucial to understand the structure of dependencies within the group of 
remaining variables. In this paper we argue that in the case of written language, the topics 
or latent variables that affect the occurrence of words are likely to present the same kind of 
behavior, that is, they tend to be inactive most of the time. And when they are active, they 
tend to favor the occurrence of specific groups of words. 

Appendix E: Significance test 

We want to assess whether a probability distribution of three variables p(x) is explained 
or not by the simpler maximum entropy model p^(x), obtained after measuring only the 
pairwise marginal probabilities. That is, taking the maximum entropy model as the null 
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hypothesis Ho, and considering as the alternative hypothesis H\ the one in which there is a 
triple dependency, we want to calculate the plausibility of the distribution p(x). In statistics 
a usual way of comparing two models, one of which is nested within the other, is a likelihood 
ratio test. 

If we take N samples, then the likelihood ratio A is given by 


A = 


P(X!,.. 

• , xjv'I-H'i) 

P(xi,.. 

• ,xjv| Ho) 


nr.,p(> 


nf.,p (2, (x. 


Considering N —> oo and using Sanov’s theorem [bij ]. it follows 


(El) 


log(A) = ND[p : p (2 )]. 


(E2) 


In addition, the result by Wilks 


36] implies that, neglecting terms of order N 1 / 2 , 


2 log(A) = Xd, 


(E3) 


that is, the logarithm of the likelihood tends to a chi-square distribution, where the number 
of degrees of freedom d equals the difference in the numbers of parameters between the 
models. Combining these two results, we conclude that under the null hypothesis, 

D\p : P (2 >] = (E4) 

where the chi-square distribution has one degree of freedom. Taking a significance of a = 
0.1% and N = 512, we reject the null hypothesis if D[p : p^} > 0.01 bits. 

An analogous analysis is done when evaluating the significance of D\pij :i kjk ■ Pikjk ], with 
the same result. 


Appendix F: Error estimation 


The estimation of the error of our measures is done by a bayesian approach 32], Es¬ 
timation problems are dominated by finite sampling in the probabilities of the different 
states. 

On the one side, we have the true probability q governing the outcome of the experiment, 
whose coordinates refers to the S possible states of the system (in our case to the eight states 
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for three binary variables). On the other side, there is the frequency count f = rii/n, where 
n l is the number of times the state i occurs, and N is the total number of measurements. 
The probability of measuring f given that the data are governed by q is the multinomial 
probability 


p(f|q) = *in 

i 





(W 


(FI) 


We have no access to q, we can only measure f. We therefore need the probability that 
the true distribution be q given that f was measured, that is, the probability density P(q|f). 
Through Bayes’ rule, 


p { q|f) 


p(f| q)-P(q) 

p ( f ) 


exp (—ND[f : q]) P(q) 
Z 


(F2) 


where P(q) is the prior probability distribution for q, and Z is the normalization over the 
domain of q. For the estimation of the error, and in the limit of a large number of samples, 
the result does not depend on the choice of the prior, as we show below. 

If we need to estimate some function of the probabilities q), the variance of the 
estimate is 


*vr = <W /2 > - W 2 . 


(F3) 


where the average is over P(q|f). In our case, we are interested in the triple information 
W( q) — D [q : q^ 2 -*], where q^ is the maximum entropy probability compatible with the 
second-order marginals. 

From 32] it follows that, in the limit N S and to a first order in 1/N, 




v ( dw 


/i(l - /») 
N 


-2EE 

i j<i 


dW dW 
dqi dqj 


fjfj 

N 


+ 0(N~ 2 ) 


(F4) 


= VqW* ■ £ -V q W, 
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where the covariance matrix of the probabilities E is 

r i - u) 


Eij < 


N 

Mi 

f N 


if i = j 

if i±3 


(F5) 


Due to finite sampling, the frequencies /, may fluctuate. From Eq. (IF4f) we see that we only 
need the covariance matrix and the gradient of W (q) evaluated in f in order to transform 
the variance of the vector f along different directions of the simplex into variance in W. It 
is important to notice that the error in W is of order 1 /\/~N, which means that if we want 
to reduce the error by half, we need to increase the number of samples fourfold. 

In our case the gradient V q W is difficult to calculate, but we can obtain the result from 
Eq. (1F4|) numerically. Given the frequency f, first we calculate the eigenvalues and eigen¬ 
vectors from the covariance matrix E given by Eq. (IF5I) . One non-degenerate eigenvector 
is orthogonal to the simplex, and has a zero eigenvalue. The remaining eigenvectors 
belong to the simplex and all have positive eigenvalues cr k , equal to the variances in the 
corresponding directions. Finally, making a small change e in the frequencies along these 
directions, we obtain the change A Iffy — W {f + ev fc ) — W{ f), so that 


s -1 


oi, = iAwy ~ 

where every erf is in the order of l/N. 


k =i 


(F6) 



d ,3) ., 

ijk 


ijk 


FIG. 9. Standard deviation of the triple information fc as a function of the Ef lk . for the triplets that 
satisfy D\- k > 0.01. A: OS. B: AM. The dashed line indicates the identity. 

Figure [9] shows the standard deviation of Df ]k obtained by this method as a function of 
Dfj k for the triplets that satisfy D'l jk > 0.01, for both books. The error lies between 0.005 
bits and 0.01 bits. 
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